Geographical analysis of media flows

A multidimensional approach

Claude Grasland (Université de Paris (Diderot), FR 2007 CIST, UMR 8504 Géographie-cités)

featured

Introduction

1 CORPUS COLLECTION : WHO AND WHEN

1.1 Importation of RSS

1.1.1 The Mediacloud database

(tbd : presentation of the MediaCloud project)

Mediacloud can be freely used by researchers. All you have to do is to create an account at the following adress :

https://explorer.mediacloud.org

You have different ways to get title of news. We will focus here on a simple example of data obtained through the mediacloud interface. We suppose that you want to extract news from the Tunisian newspapers speaking from Europe.

1.1.2 Selection of media with source manager

We use the application called Source Manager and we introduce a research by collection which is the most convenient to explore what is available in a country. In our example, the target country is Tunisia and we have three collections that are propsed :

We have selected the collection named “Tunisia National” because we are interested in the most important newspapers of the country.

The buble graphic on the right indicates immediately the media that has produced the highest number of news, but it is wise to explore in more details the list on the left which indicates for each media the statting date of data collection.

When a media appears interesting, we click on its name to obtain a brief summary of the metadata. For example, in the case of L’économiste Maghrebin the metadata indicates :

The media looks promising, but before to go further, it can be better to have a look at the website of the media to have a more concrete idea of the content if we don’t know in advance what it is about in terms of content, what is the ideological orientation, etc.

Here we can see that this is an ecnomic journal, published in french, with news organized in concentric geographic circles (Nation > Maghreb > Africa > World) which is precisely what we are looking for in the IMAGEUN project. We will further complete the informations about this, but before to do that we have to check in more details if the production of the media is regular through time with another tool offered by mediacloud, the explorer.

1.1.3 Checking the stability through time

We have clicked on search in explorer on the metadata page of the Source Manager and obtain a news interfacce where we modify the date to cover the full period of collection of the media (or our period of interest). In the research field, we let the search term * which indicates a research on all news.

Below your request, you obtain a graphic entitled Attention Over Time with the distribution of the number of news published per day which help you to verify if the distribution of news is regular through time. You just have to modify the type of graphic in order to visualize Story Count and you can choose the time span you want (day, week or month) for the evaluation of the regularity of news flow. In our example, we notice that at daily level they are some brief period of break in 2019, but the flow is reasonnabely regular with approximatively 5 news per day at the beginning and 10 to 20 in the final period. We also notice a classical week cycle with a decrease of news published during the week-end.

Going down, you will find a news panel entitled Total Attention which gives you the total number of stories found. In our example, we have a total of 13626 stories produced by our media over the period.

1.1.5 Download and storage of news

According to your selection (all news or a specific topic) you will download more or less title. Here, me make the choice to get all news, which means that we have to repeat the original request with *.

Finally, by clicking on the button Download all story URLS, you can get a .csv file that you can easily load in your favorite programming language as we will see in the next section.

1.2 Corpus creation

knitr::opts_chunk$set(cache = TRUE,
                        echo = TRUE,
                        comment = "")

In the previous section (ref…) whe have obtained a .csv file of news collected from MediaCloud. We will try now to put this data in a standard form and we have chosen the format of the quanteda package as reference for data organization and storage.

But of course the researchers involved in the project can prefer to use other R packages like tm or tidytext. And they can also prefer to use another programming language for Python. It is the reason why we explain how to transform and export the data that has been prepared and harmonized with quanteda in various format like .csv or JSON.

We detail here an example of importation with the example of the newspaper “L’économiste maghrebin”

1.2.1 Importation of text to R

This step is not always obvious because many problems of encoding can appear that are more or less easy to solve. In principle , the data from Media Cloud are exported in standard UTF-8 but as we will see it is not necessary the case.

We try firstly to use the standard R function read.csv():

store <- "data"
  media <- "fr_TUN_ecomag"
  type <-".csv"
  
  fic <- paste(store,"/",media,type,sep="")
  
  df<-read.csv(fic,
               sep=",",
               header=T,
               encoding = "UTF-8",
               stringsAsFactors = F)
  kable(head(df))
stories_id publish_date title url language ap_syndicated themes media_id media_name media_url
1129295780 2019-01-02 03:42:46 Les tarifs de l’ADSL réduits à partir du 1er janvier 2019 https://www.leconomistemaghrebin.com/2019/01/02/tarifs-adsl-reduits-1-janvier-2019/ fr False 623820 L’Economiste Maghrebin http://www.leconomistemaghrebin.com/
1129295771 2019-01-02 04:06:27 6ème Sfax Marathon International des Oliviers https://www.leconomistemaghrebin.com/2019/01/02/sfax-marathon-international-oliviers/ fr False 623820 L’Economiste Maghrebin http://www.leconomistemaghrebin.com/
1129295760 2019-01-02 06:05:08 Télécharger la version finale de la Loi de finances 2019 https://www.leconomistemaghrebin.com/2019/01/02/telecharger-la-version-finale-de-la-loi-de-finances-2019/ en False 623820 L’Economiste Maghrebin http://www.leconomistemaghrebin.com/
1129578051 2019-01-02 10:05:06 Chawki Tabib : 245 dossiers de corruption présumée transmis au ministère public https://www.leconomistemaghrebin.com/2019/01/02/chawki-tabib-245-dossiers-transferes-au-ministere-public/ fr False 623820 L’Economiste Maghrebin http://www.leconomistemaghrebin.com/
1129461662 2019-01-02 07:52:36 Panoro Energy finalise l’acquisition de OMV Tunisia https://www.leconomistemaghrebin.com/2019/01/02/panoro-energy-finalise-lacquisition-de-omv-tunisia/ fr False 623820 L’Economiste Maghrebin http://www.leconomistemaghrebin.com/
1129461636 2019-01-02 08:57:54 La partie syndicale maintient le boycott des examens du secondaire https://www.leconomistemaghrebin.com/2019/01/02/partie-syndicale-boycott-examens-secondaire/ fr False 623820 L’Economiste Maghrebin http://www.leconomistemaghrebin.com/

The importation was successfull for 12794 news but message of errors appeared for 3 news where R sent a message of error telling :

Error in gregexpr(calltext, singleline, fixed = TRUE) : regular expression is invalid UTF-8

Looking in more details, we discover also some problems of encoding in news like in the following example where the text of the news appears differently if we apply the standard functions paste() o0 the specialized function r knitr::kable for printing.

paste(df[9, 3])
[1] "Néji Jalloul : &#8220;Nidaa Tounes peut revenir si&#8230;&#8221;"
kable((df[9,3]))
x
Néji Jalloul : “Nidaa Tounes peut revenir si…”

1.2.2 Resolution of encoding problems

It is sometime possible to adapt manually the encoding problem whan they are not too much as in present example.

df$text<-df$title
  # standardize apostrophe
  df$text<-gsub("&#8217;","'",df$text)
  
  # standardize punct
  df$text<-gsub('&#8230;','.',df$text)
  
  # standardize hyphens
  df$text<-gsub('&#8211;','-',df$text)
  
  # Remove quotation marks
  df$text<-gsub('&#171;&#160;','',df$text)
  df$text<-gsub('&#160;&#187;','',df$text)
  df$text<-gsub('&#8220;','',df$text)
  df$text<-gsub('&#8221;','',df$text)
  df$text<-gsub('&#8216;','',df$text)
  df$text<-gsub('&#8243;','',df$text)

We can introduce other cleaning procedures here or keep it for later analysis

1.2.3 Transformation in quanteda format

We propose a storage based on quanteda format by just transforming the data that has been produced by readtext. We keep only the name of the source and the date of publication.

# Create Quanteda corpus
  qd<-corpus(df,docid_field = "stories_id")
  
  
  # Select docvar fields and rename media
  qd$date <-as.Date(qd$publish_date)
  qd$source <-media
  docvars(qd)<-docvars(qd)[,c("source","date")]
  
  
  
  
  # Add global meta
  meta(qd,"meta_source")<-"Media Cloud "
  meta(qd,"meta_time")<-"Download the 2021-09-30"
  meta(qd,"meta_author")<-"Elaborated by Claude Grasland"
  meta(qd,"project")<-"ANR-DFG Project IMAGEUN"

We have created a quanteda object with a lot of information stored in various fields. The structure of the object is the following one

str(qd)
 'corpus' Named chr [1:12794] "Les tarifs de l'ADSL réduits à partir du 1er janvier 2019" ...
   - attr(*, "names")= chr [1:12794] "1129295780" "1129295771" "1129295760" "1129578051" ...
   - attr(*, "docvars")='data.frame': 12794 obs. of  5 variables:
    ..$ docname_: chr [1:12794] "1129295780" "1129295771" "1129295760" "1129578051" ...
    ..$ docid_  : Factor w/ 12794 levels "1129295780","1129295771",..: 1 2 3 4 5 6 7 8 9 10 ...
    ..$ segid_  : int [1:12794] 1 1 1 1 1 1 1 1 1 1 ...
    ..$ source  : chr [1:12794] "fr_TUN_ecomag" "fr_TUN_ecomag" "fr_TUN_ecomag" "fr_TUN_ecomag" ...
    ..$ date    : Date[1:12794], format: "2019-01-02" "2019-01-02" ...
   - attr(*, "meta")=List of 3
    ..$ system:List of 6
    .. ..$ package-version:Classes 'package_version', 'numeric_version'  hidden list of 1
    .. .. ..$ : int [1:3] 3 0 0
    .. ..$ r-version      :Classes 'R_system_version', 'package_version', 'numeric_version'  hidden list of 1
    .. .. ..$ : int [1:3] 4 1 0
    .. ..$ system         : Named chr [1:3] "Windows" "x86-64" "claude"
    .. .. ..- attr(*, "names")= chr [1:3] "sysname" "machine" "user"
    .. ..$ directory      : chr "C:/git/geomedia"
    .. ..$ created        : Date[1:1], format: "2021-11-26"
    .. ..$ source         : chr "data.frame"
    ..$ object:List of 2
    .. ..$ unit   : chr "documents"
    .. ..$ summary:List of 2
    .. .. ..$ hash: chr(0) 
    .. .. ..$ data: NULL
    ..$ user  :List of 4
    .. ..$ meta_source: chr "Media Cloud "
    .. ..$ meta_time  : chr "Download the 2021-09-30"
    .. ..$ meta_author: chr "Elaborated by Claude Grasland"
    .. ..$ project    : chr "ANR-DFG Project IMAGEUN"

We can look at the first titles with head()

kable(head(qd,3))
x
1129295780 Les tarifs de l’ADSL réduits à partir du 1er janvier 2019
1129295771 6ème Sfax Marathon International des Oliviers
1129295760 Télécharger la version finale de la Loi de finances 2019

We can get meta information on each stories with summary()

summary(head(qd,3))
Corpus consisting of 3 documents, showing 3 documents:

         Text Types Tokens Sentences        source       date
   1129295780    11     11         1 fr_TUN_ecomag 2019-01-02
   1129295771     6      6         1 fr_TUN_ecomag 2019-01-02
   1129295760     8     10         1 fr_TUN_ecomag 2019-01-02

We can get meta information about the full document

meta(qd)
$meta_source
  [1] "Media Cloud "

  $meta_time
  [1] "Download the 2021-09-30"

  $meta_author
  [1] "Elaborated by Claude Grasland"

  $project
  [1] "ANR-DFG Project IMAGEUN"

1.2.4 Storage of the quanteda object

We can finally save the object in .RDS format in a directory dedicated to our quanteda files. It can be usefull to give some information in the name of the file

store <- "data"
  type<- ".RDS"
  myfile <- paste(store,"/",media,type,sep="")
  myfile
[1] "data/fr_TUN_ecomag.RDS"
saveRDS(qd,myfile)
  qd[1:3]
Corpus consisting of 3 documents and 2 docvars.
  1129295780 :
  "Les tarifs de l'ADSL réduits à partir du 1er janvier 2019"

  1129295771 :
  "6ème Sfax Marathon International des Oliviers"

  1129295760 :
  "Télécharger la version finale de la Loi de finances 2019"
summary(qd,3)
Corpus consisting of 12794 documents, showing 3 documents:

         Text Types Tokens Sentences        source       date
   1129295780    11     11         1 fr_TUN_ecomag 2019-01-02
   1129295771     6      6         1 fr_TUN_ecomag 2019-01-02
   1129295760     8     10         1 fr_TUN_ecomag 2019-01-02

We have kept all the information present in the initial file, but also added specific metadata of interest for us. The size of the storage is now equal to 0.6 Mb which means a division by 6 as compared to the initial .csv file downloaded from Media Cloud where the size was 3.8 Mb.

1.2.5 Back transformation to tibble

In the following steps, we will make an intensive use of quanteda, but sometimes it can be useful to export the results in a more practical format or to use other packages. For this reasons, it is important to know that the tidytextpackage can easily transform quanteda object in tibbles which are more classical and easy to manage and to export in other formats like data.frame or data.table.

td <- tidy(qd)
  kable(head(td))
text source date
Les tarifs de l’ADSL réduits à partir du 1er janvier 2019 fr_TUN_ecomag 2019-01-02
6ème Sfax Marathon International des Oliviers fr_TUN_ecomag 2019-01-02
Télécharger la version finale de la Loi de finances 2019 fr_TUN_ecomag 2019-01-02
Chawki Tabib : 245 dossiers de corruption présumée transmis au ministère public fr_TUN_ecomag 2019-01-02
Panoro Energy finalise l’acquisition de OMV Tunisia fr_TUN_ecomag 2019-01-02
La partie syndicale maintient le boycott des examens du secondaire fr_TUN_ecomag 2019-01-02
str(td)
tibble [12,794 x 3] (S3: tbl_df/tbl/data.frame)
   $ text  : chr [1:12794] "Les tarifs de l'ADSL réduits à partir du 1er janvier 2019" "6ème Sfax Marathon International des Oliviers" "Télécharger la version finale de la Loi de finances 2019" "Chawki Tabib : 245 dossiers de corruption présumée transmis au ministère public" ...
   $ source: chr [1:12794] "fr_TUN_ecomag" "fr_TUN_ecomag" "fr_TUN_ecomag" "fr_TUN_ecomag" ...
   $ date  : Date[1:12794], format: "2019-01-02" "2019-01-02" ...

2 DICTIONARIES AND TAGS : WHAT AND WHERE

The objective of this section is to explore the possibility of Wikipedia and related tools (Wikidata, Wikimedia, …) for the production of multilingual dictionaries dedicated to the identification of geographical objects (WHERE) or different topics (WHAT) that are mentioned in the news l The ambition is to produce dictionaries in different languages in order to check if the results are really comparable and if it is possible to elaborate cross-language analysis of a corpus of media from different countries in different languages.

2.1 Wikipedia entities

Wikidata defines itself as

  • a free and open knowledge base that can be read and edited by both humans and machines.
  • as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others.
  • a support to many other sites and services beyond just Wikimedia projects! The content of Wikidata is available under a free license, exported using standard formats, and can be interlinked to other open data sets on the linked data web.

2.1.1 Codification of entities

The first interest of wikidata is to provide unique code of identifications of objects. For example a research about “Africa” will produce a list of different objects characterized by a unique code :

knitr::include_graphics("figures/Wikidata001.png")

2.1.2 Informations on entities

Once we have selected an entity (e.g. Q15) we obtain a new page with more detailed informations in english but also in all other languages available in Wikipedia.

knitr::include_graphics("figures/Wikidata002.png")

A lot of information are available concerning the entity but, at this stage, the most important ones for our research are :

  1. the translation in different languages
  2. the equivalent words or expression in different languages
  3. the definitions in different languages
  4. the ambiguity of the term in each language and the potential risks of confusion with other entities.

Of course we should not take for granted the answers proposed by wikidata (as noticed by Georg, Wikipedia is a matter of research for IMAGEUN …) but without any doubt, it offers a very good opportunity to clarify our questions and help us to build tools for recognition of world regions and other geographical imaginations in a multilingual perspective.

2.1.3 Wikipedia entities as nodes of an ontolongy

It appears crucial to introduce here a clear distinction between Wikipedia entities and textual units associated to the names and definiton of this units.

A wikipedia entity like Q15 is an element of an ontology designed by its author for specific purposes. The specificity of the wikidata ontology is the fact that it is a multilinligual web where Q15 is a node of the web present in different linguistic layers. It means that we don’t have a single name or a single definition of Q15, except if we adopt the neocolonial perspective to choose the english language as reference. Depending on the context (i.e. the language or sub-language), Q15 could be defined as :

  • (fr) : A “continent” named “Afrique”"
  • (en) : A “continent on the Earth’s northern and southern hemispheres” named “Africa” or “African continent”
  • (de) : A “Kontinent auf der Nord- und Südhalbkugel der Erde” named “Afrika”
  • (tr) : A “Dünya’nın kuzey ve güney yarıkürelerindeki bir kıta” named “Afrika” or “Afrika kıtası”

In other words the existence of the same code of wikipedia entities does not offer any guarantee of concordance between the geographical objects found in news published in different languages or different countries. But - and it is the important point - it help us to point similarities and differences between set of geographical entities that are more or less comparable in each language.

2.1.4 A tool for cross-linguistical experiments

Having in mind the limits of the equivalence of entities across languages, it can nevertheless be an interesting experience to select a set of wikipedia entities (Q15, Q258, Q4412 …) and to examine their relative frequency in our different media from different countries with different languages. A typical hypothesis could be something like :

  • Is Q15 more mentionned than Q46 in Tunisian newspapers ?

which is not equivalent to the question

  • Is Africa more mentionned than Europe in Tunisian newspapers

2.1.5 The package WikidataR

The package WikidataR is an interface for the use of the Wikidata API in R language. Equivalent tools are available in Python and other languages for those non familiar with R. And it is of course possible to use directly the API. The first step is to install the most recent version of the R package WikidataR which install also related packages of interest.

#install.packages("WikidataR")
  library(WikidataR)

(based on Etienne Toureille previous experiments)

2.1.6 identification of entities of interest

The function find_item will help to find all wikipedia entities (=items) associated to a textual unit (word or group of word) in given language. Let’s start with the research of entities associated to “Afrique” in french language :

mytext <- "Afrique"
  
  items <- find_item(search_term = mytext,
                     language = "fr",
                     limit=30)
  class(items)
[1] "find_item"
length(items)
[1] 30

The resulting object is an object from type find_item which is in practice a list describing the entities that has been recognized associated to the textual unit that we have chosen. In the french cas, we have found 50 entities that match with our textual unit. Let’s have a look at the first one :

items[[1]]
$id
  [1] "Q15"

  $title
  [1] "Q15"

  $pageid
  [1] 111

  $repository
  [1] "wikidata"

  $url
  [1] "//www.wikidata.org/wiki/Q15"

  $concepturi
  [1] "http://www.wikidata.org/entity/Q15"

  $label
  [1] "Africa"

  $description
  [1] "continent on the Earth's northern and southern hemispheres"

  $match
  $match$type
  [1] "label"

  $match$language
  [1] "fr"

  $match$text
  [1] "Afrique"


  $aliases
  $aliases[[1]]
  [1] "Afrique"

As we can see we can easily identify the code the label and description in english but also the text responsible from the matching answer in french. We can therefore create a function item_info that extract all elements of interest and put them in a table in order to have a complete view.

item_info <- function(my_item){ 
    
  
      if (is.null(my_item$id) == F){item_id = my_item$id}
          else {item_id  = NA}
    
      if (is.null(my_item$label) ==F){item_label = my_item$label}
          else {item_label  = NA}
    
      if (is.null(my_item$desc) == F) {item_desc= my_item$desc}
          else {item_desc  = NA}
    
      if (is.null(my_item$match$lang) ==F){item_lang = my_item$match$lang}
          else {item_lang  = NA}
    
      if (is.null(my_item$match$text) ==F){item_text = my_item$match$text}
          else {item_text  = NA}
  
    
    res<-data.frame(item_id,item_label,item_desc,item_lang,item_text)
    
    return(res) 
    }

For example :

item_info(items[[1]])
  item_id item_label                                                  item_desc
  1     Q15     Africa continent on the Earth's northern and southern hemispheres
    item_lang item_text
  1        fr   Afrique

We build then a second function that extract all the wikipedia entities associated to a textual unit for a given language

extract_entities <- function(mytext= "Afrique",
                               mylang = "fr",
                               maxres = 20) {
    # Extract items
    items <- find_item(search_term =  mytext,
                     language      =  mylang,
                     limit         =  maxres)
    
    # Create empty dataset
    res<-data.frame()
    res$item_id    <- as.character()
    res$item_label <- as.character()
    res$item_desc <- as.character()
    res$item_lang  <- as.character()
    res$item_text  <- as.character()
    
    # Fill dataset
    k<-length(items)
        for (i in 1:k) {
             res <- rbind(res,item_info(items[[i]]))
        }
    
    # Return dataset
    return(res)
  
  }

For example :

tab <- extract_entities("Afrique","fr",20)
  kable(tab)
item_id item_label item_desc item_lang item_text
Q15 Africa continent on the Earth’s northern and southern hemispheres fr Afrique
Q181238 Africa Roman province on the northern African coast covering parts of present-day Tunisia, Algeria, and Libya fr Afrique
Q203548 African Plate continental plate underlying Africa fr Afrique
Q258 South Africa sovereign state in Southern Africa fr Afrique du Sud
Q27433 Central Africa core region of the African continent fr Afrique centrale
Q4412 West Africa region of Africa fr Afrique de l’Ouest
Q132959 Sub-Saharan Africa area of the continent of Africa that lies south of the Sahara Desert fr Afrique subsaharienne
Q27394 Southern Africa southernmost region of the African continent fr Afrique australe
Q27407 East Africa easterly region of the African continent fr Afrique de l’Est
Q27381 North Africa northernmost region of the African continent fr Afrique du Nord
Q2826196 Afrique Wikimedia disambiguation page fr Afrique
Q23639892 Africa artwork by Eugène Delaplanche in Paris, France fr Afrique
Q66022909 Afrique NA fr Afrique
Q153963 German East Africa former German posesssion in the African Great Lakes region between 1884–1919 fr Afrique orientale allemande
Q4690138 Afrique album by Count Basie fr Afrique
Q65574303 Afrique NA fr Afrique
Q56317928 Afrique NA fr Afrique
Q210682 French West Africa French colonial federation (1895–1958) fr Afrique-Occidentale française
Q106179043 Afrique NA en Afrique
Q271894 French Equatorial Africa federation of French colonial possessions in Central Africa fr Afrique-Équatoriale française

As we can see, many of the entities proposed in he list are not interesting and we will probably have to select one by one the entities of interest. But we have clearly to keep two different list of entities :

  • the target entities : that we consider as potential world regions or candidate to te title of “geographic imagination”.
  • the control entites : that we have to identify or eliminate if we want to identify correctly our target entities like the country of South Africa

In the case of Africa, we could for example establish a more limited list

entit <- c("Q15", "Q4412","Q132959", "Q27394","Q27407","Q27381","Q27433","Q258")
  
  tab<-tab %>% filter(item_id %in% entit)
  kable(tab)
item_id item_label item_desc item_lang item_text
Q15 Africa continent on the Earth’s northern and southern hemispheres fr Afrique
Q258 South Africa sovereign state in Southern Africa fr Afrique du Sud
Q27433 Central Africa core region of the African continent fr Afrique centrale
Q4412 West Africa region of Africa fr Afrique de l’Ouest
Q132959 Sub-Saharan Africa area of the continent of Africa that lies south of the Sahara Desert fr Afrique subsaharienne
Q27394 Southern Africa southernmost region of the African continent fr Afrique australe
Q27407 East Africa easterly region of the African continent fr Afrique de l’Est
Q27381 North Africa northernmost region of the African continent fr Afrique du Nord

But this list which was based on the french textual units associated to “Afrique” should certainly be completed by equivalent list established for other languages with different seeds (“Africa” in english, “Afrika” in german, …)

2.1.7 Elaboration of a cross_linguistic dictionnary

Admitting that we have established a list of wikipedia entities of interest, we can now turn to the creation of a dictionary for the identification of these entities in different languages. We will use for that purpose the powerful function get_properties

item_prop <- get_property("Q15")[[1]]

The result is a very large object (list of list) which provide all the informations (or links toward these information) in all languages wher the object is available. The problem is therefore to understand the structure of this object and to extract exactly what we need. In our case, we want to extract for each language of interest.

The information will be separated in two datasets :

  • dictionary of definitions
  • dictionary of labels and aliases

We create two functions dedicated to each of the tasks

extract_def <- function(item = c("Q15", "Q246"),
                          langs = c("fr","de","en","tr")) {
    # Create empty dataset
    res<-data.frame()
    res$id    <- as.character()
    res$lang  <- as.character()
    res$label <- as.character()
    res$desc  <- as.character()
    
    
    # Loop of items
    n <- length(item)
    for (i in 1:n) {
      
       # Extract item properties
      item_prop <- get_property(item[i])[[1]]
    
     
       # Loop  of language
       p<-length(langs)
       for (j in 1:p) {
          id <- item[i]
          lang  <- langs[j]
          if(is.null(item_prop[["labels"]][[lang]]$value)==F) {label <- item_prop[["labels"]][[lang]]$value}
             else { label <- NA}
          if(is.null(item_prop[["descriptions"]][[lang]]$value)==F) {desc <- item_prop[["descriptions"]][[lang]]$value}
             else { desc <- NA}
          add <-data.frame(id,lang,label,desc)
          res<- rbind(res,add) 
          }
    
    }
    # Export result
  return(res)
  
  }

The function works proprerly as long as the entities are available in all languages. It should be adapted to prevent errors when an entity is not available in one language.

entit <- c("Q15", "Q4412","Q132959", "Q27394","Q27407","Q27381","Q27433","Q258")
  
  
  tab<-extract_def(entit,c("fr","de","tr","en"))
  kable(tab)
id lang label desc
Q15 fr Afrique continent
Q15 de Afrika Kontinent auf der Nord- und Südhalbkugel der Erde
Q15 tr Afrika Dünya’nin kuzey ve güney yarikürelerindeki bir kita
Q15 en Africa continent on the Earth’s northern and southern hemispheres
Q4412 fr Afrique de l’Ouest région d’Afrique
Q4412 de Westafrika Kontinentalteil
Q4412 tr Bati Afrika Afrika’nin batisindaki 16 ülkenin bulundugu alan
Q4412 en West Africa region of Africa
Q132959 fr Afrique subsaharienne partie du continent africain au sud du Sahara
Q132959 de Subsahara-Afrika südlich der Sahara gelegener Teil Afrikas
Q132959 tr Sahraalti Afrika NA
Q132959 en Sub-Saharan Africa area of the continent of Africa that lies south of the Sahara Desert
Q27394 fr Afrique australe région la plus méridionale du continent africain
Q27394 de Südliches Afrika Region in Afrika
Q27394 tr Güney Afrika NA
Q27394 en Southern Africa southernmost region of the African continent
Q27407 fr Afrique de l’Est région d’Afrique
Q27407 de Ostafrika Region in Afrika
Q27407 tr Dogu Afrika NA
Q27407 en East Africa easterly region of the African continent
Q27381 fr Afrique du Nord région en Afrique
Q27381 de Nordafrika Region in Afrika
Q27381 tr Kuzey Afrika Afrika kitasinin Fas, Cezayir, Tunus, Libya, Misir ve Sudan’i içeren kuzey bölgesi
Q27381 en North Africa northernmost region of the African continent
Q27433 fr Afrique centrale Région d’Afrique
Q27433 de Zentralafrika Region in Afrika
Q27433 tr Orta Afrika Afrika kitasinin Burundi, Orta Afrika Cumhuriyeti, Çad, Kongo Demokratik Cumhuriyeti ve Ruanda’yi barindiran orta kismi
Q27433 en Central Africa core region of the African continent
Q258 fr Afrique du Sud pays d’Afrique
Q258 de Südafrika Staat im südlichen Afrika
Q258 tr Güney Afrika Cumhuriyeti Güney Afrika’da bulunan bir ülke
Q258 en South Africa sovereign state in Southern Africa

2.1.8 Extraction of aliases

Now we have to extract the aliases which are two texts corresponding to the same entity in a given,language. For example, the Q27394 which correspond to the southern part of Africa (a subregion, not a country) is associated in spanish language to one main label and three equivalenbt alisases

item_prop <- get_property("Q27394")[[1]]
  item_prop$labels$es$value
[1] "África austral"
item_prop$aliases$es
  language             value
  1       es África meridional
  2       es    África del Sur
  3       es     sur de África

But in french language, no aliases are mentioned :

item_prop$labels$fr$value
[1] "Afrique australe"
item_prop$aliases$fr
NULL

The fact that no aliases are mentioned in french language can be considered as non logical as compared to spanish language. And we could certainly imagine to add in french the translation of two spanish aliases: “Afrique méridionale”, “Sud de l’Afrique”. But we can not add “Afrique du Sud” because it is related in french to the state and not to the subregion.

Despite the fact that they are not complete, the aliases are certainly a good solution when we want to obtain more efficient dictionaries. For example, if we want to obtain the state of southern Africa (Q258), we can complete the official label by 4 alias in french language and 3 aliases in spanish, taking into account the fact that the text is in upper orlowercase, withor without accent, …

item_prop <- get_property("Q258")[[1]]
  item_prop$labels$es$value
[1] "Sudáfrica"
item_prop$aliases$es
  language                  value
  1       es República de Sudáfrica
  2       es              Sudafrica
  3       es Republica de Sudafrica
item_prop$labels$fr$value
[1] "Afrique du Sud"
item_prop$aliases$fr
  language                       value
  1       fr    République sud-africaine
  2       fr République d’Afrique du Sud
  3       fr    république sud-africaine
  4       fr république d’Afrique du Sud
lang="fr"
  is.null(item_prop[["aliases"]][[lang]])!=F
[1] FALSE
ali <- item_prop[["aliases"]][[lang]]$value
  n<-length(ali)
  for (i in 1:n) { print(ali[i])}
[1] "République sud-africaine"
  [1] "République d’Afrique du Sud"
  [1] "république sud-africaine"
  [1] "république d’Afrique du Sud"

We propose therefore a function called extract_alias which propose for each entity of interest a list of texts and aliases adapte to each language. We do not store the definition which has been otained previously with the function extract_def :

extract_alias <- function(items = c("Q15", "Q258"),
                            langs = c("fr","de","en","tr")) {
    # Create empty dataset
    res<-data.frame()
    res$id    <- as.character()
    res$lang  <- as.character()
    res$label <- as.character()
  
    
    # Loop of items
    n <- length(items)
    for (i in 1:n) {
      
       # Extract item properties
      item_prop <- get_property(items[i])[[1]]
    
     
       # Loop  of language
       p<-length(langs)
       for (j in 1:p) {
          id <- items[i]
          lang  <- langs[j]
          if(is.null(item_prop[["labels"]][[lang]]$value)==F) {label <- item_prop[["labels"]][[lang]]$value} else { label <- NA}
          if(is.null(item_prop[["descriptions"]][[lang]]$value)==F) {desc <- item_prop[["descriptions"]][[lang]]$value}else { desc <- NA}
          add <-data.frame(id,lang,label)
          res<- rbind(res,add) 
             # Loop of aliases
                if (is.null(item_prop[["aliases"]][[lang]])==F) {
                  ali <- item_prop[["aliases"]][[lang]]$value
                  n<-length(ali)
                 for (k in 1:n) {
                          label <- ali[k]
                          add <-data.frame(id,lang,label)
                          res<- rbind(res,add) 
                    }
                }
          
          }
    
    }
    # Export result
  return(res)
  
  }

Let’s try the function on the case of the continent of “Africa” (Q15), the subregion “South of Africa” (Q27394) and the state of “Southern Africa” (Q258) in five languages :

tab<- extract_alias(items = c("Q15", "Q27394", "Q258"),
                langs = c("fr","de","en","tr"))
  kable(tab)
id lang label
Q15 fr Afrique
Q15 de Afrika
Q15 en Africa
Q15 en African continent
Q15 en Ancient Libya
Q15 tr Afrika
Q15 tr Afrika kitasi
Q27394 fr Afrique australe
Q27394 de Südliches Afrika
Q27394 de Südafrika
Q27394 en Southern Africa
Q27394 tr Güney Afrika
Q258 fr Afrique du Sud
Q258 fr République sud-africaine
Q258 fr République d’Afrique du Sud
Q258 fr république sud-africaine
Q258 fr république d’Afrique du Sud
Q258 de Südafrika
Q258 de Suedafrika
Q258 de Republik Südafrika
Q258 en South Africa
Q258 en Republic of South Africa
Q258 en RSA
Q258 en SA
Q258 en za
Q258 en <U+0001F1FF><U+0001F1E6>
Q258 en zaf
Q258 tr Güney Afrika Cumhuriyeti

The function works !

2.1.9 Conclusion

It is now possible to develop a global research strategy for the analysis of world regions :

1. Define a set of target regions in one language : In our example, it was based on the use of the term “Afrique” in french language, but we can imagine a different list.

2. Identify the code of the wikidata entities associated to this target regions : We have generally a lot of entities of minor interest.

3. Identify the code of the other wikidata entities that should be added for control : As we have seen, some entities are likely to create confusion and ambiguity in the definition of target entities. This entity will be transformed in compound or eliminate from the text before to look for the target entities.

4. Extract the properties of the entities in the different languages of interest : this step can be an opportunity to return to step 1. For example, it it appears that some subdivisions of Africa are available in english or german language but not in french.

5. Compare the definitions of Wikipedia entities in different languages : it is important to check if the assumption of identity of the entities is correct or not. If not, some entities will be eliminated from the list.

6. Extract the dictionary of recognition of entities : which can be done in a multilanguage perspective.

It is obviously possible to apply the same procedure to different objects like states, capital cities, organizations, people, etc…

2.2 Geographical tags (WHERE)

We discuss in this section the steps (automatic or manual) that are requested for the creation of a dictionary of states, applied to the case of french language.

2.2.2 Extract définitions

We extract the definitions of the regions in one or different languages with the function extract_def()

In our example, we extract the definition of the target entities in french language. This operation can take several minutes as the number of entities is important.

## NOT RUN : need several minutes !!!##
  wiki_def <- extract_def(ent$wiki,c("fr"))
  
  write.table(x = wiki_def,
               row.names = FALSE,
                file = "data/states_wiki_def.csv",
                fileEncoding = "UTF-8",
                sep = ";")
  saveRDS(object = wiki_def, file = "data/states_wiki_def.RDS")
id lang label desc
Q40811 fr Soukhoumi ville de la Géorgie
Q31354462 fr Abkhazie république avec reconnaissance limitée du Caucase
Q5838 fr Kaboul capitale de l’Afghanistan
Q889 fr Afghanistan pays d’Asie centrale
Q3897 fr Luanda capitale de l’Angola
Q916 fr Angola pays d’Afrique du sud-ouest

The dictionary of entities is now available in the target language and we can verify manually that all definitions exist. It can indeed happen that a wikipedia entity is not defined in one or several language. In this case, it is necessary to complete manually.

2.2.3 Extract of aliases and creation of wiki dictionary

We will know extract the aliases of each wikipedia object in our target language. As in the previous case, the operation can take some minutes when the list of entities is long as in present case.

## NOT RUN : very long time ...###
  wiki_def <- readRDS("data/states_wiki_def.RDS")
  
  wiki_dict <- extract_alias(wiki_def$id, c("fr"))
  
  write.table(x = wiki_dict,
                row.names = FALSE,
                file = "data/states_wiki_dict.csv",
                fileEncoding = "UTF-8",
                sep = ";")
  saveRDS(object = wiki_dict, file = "data/states_wiki_dict.RDS")

The list of alias is interesting for a better recognition of the states, as we can see with the example of Switzerland :

id lang label
702 Q39 fr Confédération helvétique
703 Q39 fr Confédération suisse
704 Q39 fr Helvétie
705 Q39 fr la Confédération suisse
706 Q39 fr Suisse

But the list has to be carefully checked and verified because Wikipedia can introduce a lot of terms that are ambiguous and could produce false positive. In particular, wikipedia introduce the ISO3 and ISO2 codes of states which can produce a lot of mistakes. Just consider the case of the ISO2 code of Austria (“AU”) which will create a lot of false positive if we try to look for the word in lower case. On the other hand, it is not possible to eliminate aliases based on a minimu number of characters because in this case we could eliminate “EU” if we are looking for European Union in english news. Finally it appears necessary :

  1. to examine manually the list of aliases and eliminate the one that are not relevant.
  2. to identify aliases that are associated to several different entities and to decide either to eliminate them or to limit them to one wiki entity. For example “Singapour” should be allowed only to the state name or the city name.
  3. to identify ambiguous cases that can not be solved automatically and desserve more sophisticated methods. As an example, it is impossible to decide automatically if “Brussels” is associated to Belgium or European Union in a news.
  4. To be very careful with the apostroph that can take different forms in terms of encoding. Personally we prefer to replace all forms of apostroph by a blank character but it can be a matter of debate.

Once the manual work is finished the dictionary of wiki entities is updated with a new version number.

wiki_dict <- read.table("data/states_wiki_dict_V2.csv",
                        header = T,
                        encoding = "UTF-8",
                        sep=";")
  saveRDS(object = wiki_dict, file = "data/states_wiki_dict_V2.RDS")

2.2.4 Creation of a unified dictionary of states

In Wikipedia’s ontology we have kept a distinction between entities related to states’ names and entities related to capital cities of states. But it is possible that in our target ontology this distinction does no more exist and what we are looking for is a unified dictionary of states linked to an ISO3 code in order to produce map. In this case, we have to produce a dictionary of states that will merge names and capital city. To achieve this task we load our initial dictionary of states codes with the dictionary of wiki entities and we eliminate the duplicates.

ent<-read.table("data/states_codes.csv", 
                  sep=";",
                  header=T,
                  encoding = "UTF-8")
  ent<-ent[,c(1,3,4,5)]
  
  wiki_dict<-wiki_dict <- read.table("data/states_wiki_dict_V2.csv",
                        header = T,
                        encoding = "UTF-8",
                        sep=";")
  geo_dict<-merge(wiki_dict, ent, by.x="id",by.y="wiki",all.x=T,all.y=F)
  geo_dict<-unique(geo_dict)
  geo_dict<-geo_dict[order(geo_dict$iso3),]
  
  write.table(x = geo_dict,
                row.names = FALSE,
                file = "data/states_geo_dict.csv",
                fileEncoding = "UTF-8",
                sep = ";")
  saveRDS(object = geo_dict, file = "data/states_geo_dict.RDS")

We have obtained finally a dictionary where the column “label” is associated to textual entities that can be associated to three different codes :

  • the Wikipedia entity
  • the ISO3 code of states with the type of attribute (state or capital)
  • the ISO3 code of states

It is therefore possible to use different strategies of states recognition in the analysis that will be further developed. Consider for example the case of Belgium :

kable(geo_dict[geo_dict$iso3=="BEL",])
id lang label type code iso3
453 Q239 fr Bruxelles capital_city CA_BEL BEL
454 Q239 fr Bruxelles-Ville capital_city CA_BEL BEL
455 Q239 fr ville de Bruxelles capital_city CA_BEL BEL
543 Q31 fr Belg. country_name NA_BEL BEL
544 Q31 fr Belgique country_name NA_BEL BEL
545 Q31 fr Royaume de Belgique country_name NA_BEL BEL

We can decide to recognize the country only by the name of the country, or we can combine both criteria or keep all labels except the first one because it is likely to be a metaphor of EU …

2.2.5 Extract tags function

We have elaborated a function for the extraction of geographical units based on the dictionary elaborated in previous section (dict) according to the language (lang), the decision to split some tokens (split) to move or not to lower case (tolow) and the possibility to add a list of compounds to be realized (comps) in order to eliminate ambiguities.

extract_tags <- function(qd = qd,                      # the corpus of interest
                           lang = "fr",                  # the language to be used
                           dict = dict,                  # the dictionary of target 
                           code = "id" ,                  # variable used for coding
                           name = "tags",                   #name of tags"
                           split  = c("'","’","-"),       # split list
                           tolow = FALSE  ,                # Tokenize text
                           comps = c("Afrique du sud")  # compounds
                           )
  { 
  
  
    
  # Tokenize  
  x<-as.character(qd)
  
  
  if(length(split) > 0) { reg<-paste(split, collapse = '|')
                         x <- gsub(reg," ",x)}  
  if(tolow) { x <- tolower(x)} 
  toks<-tokens(x)
  
  # compounds
  if(length(split) > 0) { reg<-paste(split, collapse = '|')
                         comps<- gsub(reg," ",comps)}  
  if(tolow)       {comps <- tolower(comps)}  
  toks<-tokens_compound(toks,pattern=phrase(comps))
  
    
  # Load dictionaries and create compounds
  
    ## Target dictionary
  dict<-dict[dict$lang==lang & is.na(dict$label)==F,]
  target<-dict[ntoken(dict$label)>1,]
  labels <-dict$label
  if(length(split) > 0) { reg<-paste(split, collapse = '|')
                         labels<- gsub(reg," ",labels)}  
  if(tolow)       {labels <- tolower(labels)}  
  toks<-tokens_compound(toks,pattern=phrase(labels))
    
   # create quanteda dictionary
  keys <-gsub(" ","_",labels)
  qd_dict<-as.list(keys)
  names(qd_dict)<-dict[[code]]
  qd_dict<-dictionary(qd_dict,tolower = FALSE)
  
  # Identify geo tags (states or reg or org ...)
  toks_tags <- tokens_lookup(toks, qd_dict, case_insensitive = F)
  toks_tags <- lapply(toks_tags, unique)
  toks_tags<-as.tokens(toks_tags)
  list_tags<-function(x){res<-paste(x, collapse=' ')}
  docvars(qd)[[name]]<-as.character(lapply(toks_tags,FUN=list_tags))
  docvars(qd)[[paste("nb_",name,sep="")]]<-ntoken(toks_tags)
  
  
  
  # Export results
  return(qd)
   }

The function looks rather complex but its application is relatively simple. Let us apply it to the corpus of news that we have collected in the first part.

2.2.6 Identification of foreign states

Here we decide to use the ISO3 code and to create a column of tag that will indicate the list of states mentioned in each news and add a column indicating the number of states found. For the moment we keep the country where the newspaper is located in the list of countries identified.But wif we want to focus on foreign countries it will be possible to eliminate it later.

dict<-readRDS("data/states_geo_dict.RDS")
  
  
  
  qd <- readRDS("data/fr_TUN_ecomag.RDS")
  
  docvars(qd)<-docvars(qd)[c("source","date")]
  qd<-corpus_subset(qd,duplicated(as.character(qd))==FALSE)
  
  frcomps<-c("France Inter", "France Info","France Soir",
             "Bourse de Paris", "Paris SG", "Ville de Paris", "Grand Paris")
  
  qd <- extract_tags (qd = qd,
                       lang="fr",
                       dict = dict,
                       code = "iso3",
                       name = "states",
                       split = c("'","’","-"),
                       comps = frcomps,
                       tolow = FALSE)
  
  
  saveRDS(qd,"data/fr_TUN_ecomag_geo.RDS")
  
  
  table(qd$nb_states)

      0     1     2     3     4 
  10229  2229   298    10     2 

2.2.7 Validation of results

For the validation of results, we extract from the quanteda object a table that contains only the text of news, the tags and the number of tags. You can see below the news where the most important number of foreign states has been found :

check<-data.frame(text=as.character(qd),states=qd$states,nb_states=qd$nb_states)
  check<-check[order(check$nb_states,decreasing = T),]
  kable(check[1:10,])
text states nb_states
1672551765 Coronavirus : La Belgique, l’Espagne, la Suisse et la Chine retirées de la liste verte BEL ESP CHE CHN 4
1258830712 Mondial 2022 à 48 pays: vers une co-organisation Qatar-Koweit- Oman ? QAT KWT OMN 3
1557214307 Le Canada, l’Australie, la Norvège et la GB boycottent les JO de Tokyo AUS NOR JPN 3
1803383307 Suspension de tous les vols à l’aller, au retour et en transit entre les aéroports tunisiens et ceux du Royaume-Uni, de l’Afrique du Sud et de l’Australie GBR ZAF AUS 3
1942722871 Route transsaharienne : Alger et Tunis bientôt reliées à Bamako, Niamey, N’Djamena et Lagos DZA MLI NER 3
2032791343 Héla Cheikhrouhou nommée vice-présidente régionale de l’IFC au Moyen-Orient, en Asie centrale, en Turquie, en Afghanistan et au Pakistan TUR AFG PAK 3
1285190782 Azza Besbes (Escrime) : mon objectif ? Le podium aux J-O de Tokyo 2020 PSE JPN 2
1298232030 IMD : Singapour classée économie la plus compétitive au monde devant les USA SGP USA 2
1304454807 Attractivité économique 2019 : La France dépasse l’Allemagne selon l’étude d’EY FRA DEU 2
1311968174 USA: 500 entreprises sollicitent Trump de trouver un accord avec la Chine USA CHN 2

Looking at this table, it is possible to check

  • the existence of false positive i.e. countries that has been identified but are not present . In our small sample of news. For example, we notice here a false identification of Palestinian territories in the news 1285190782 which is due to the name Azza that has been added by Wikipedia to the dictionary as a label related to the city of Gaza.

  • the existence of false negative i;e. countries that has not been identified but are present in the news according to an expert observer. For example, the Canada has not been identified in the news 1557214307 which is really mysterious as the label Canada is present in the dictionary. It can be due to the presence of an invisible character ? More logical is the fact that Nigeria is not identified in the news 1942722871 because Lagos is not the capital city of these country and is not present in the dictionary.

One more time, the manual analysis of results appears necessary and will certainly oblige to repeat several time the last stage of the pipeline until the elaboration of a sufficiently efficient dictionary.

2.3 Topical tags (WHAT)

2.3.1 Brexit mean Brexit !

The most simple case of definition of a topic is the case where it can be define by a single word. A good example of that can be the identification of news where the word “Brexit” is present. The problems of translation are generally limited in the majority of languages that use the same word than in English language or a single one.

We can therefore create easily a dictionary with the minimum of columns required

code =c("brex","brex")
  label=c("brexit","Brexit")
  lang = c("fr","fr")
  dict <- data.frame(code,label,lang)
  dict
  code  label lang
  1 brex brexit   fr
  2 brex Brexit   fr

Then we apply the function extract_tags that we have seen in the previous section :

qd <- readRDS("data/fr_TUN_ecomag.RDS")
  
  qd<-corpus_subset(qd,duplicated(as.character(qd))==FALSE)
  
  frcomps<-c("France Inter", "France Info","France Soir",
             "Bourse de Paris", "Paris SG", "Ville de Paris", "Grand Paris")
  
  qd <- extract_tags (qd = qd,
                       lang="fr",
                       dict = dict,
                       code = "code",
                       name = "brex",
                       split = c("'","’","-"),
                       comps = frcomps,
                       tolow = FALSE)
  
  table(qd$nb_brex)

      0     1 
  12761     7 

The results are rather disappointing as we find only 7 news related to Brexit in our corpus. So we can have a look at this news but it is not interesting to store the tags :

w<-tidy(corpus_subset(qd,nb_brex>0))
  kable(w)
text source date brex nb_brex
OMC : les effets du Brexit dépendront de l’accord auquel pourraient parvenir le Royaume-Uni et l’UE fr_TUN_ecomag 2019-04-03 brex 1
Brexit: Amal Clooney démissionne et juge «lamentable» de revenir sur l’accord fr_TUN_ecomag 2020-09-19 brex 1
Brexit : entame des procédures de « divorce » avec l’Union européenne fr_TUN_ecomag 2020-12-09 brex 1
Accord post-Brexit: les bourses européennes se réjouissent fr_TUN_ecomag 2020-12-25 brex 1
Brexit : l’accord de continuité commerciale entre la Tunisie et le Royaume-Uni entrera en vigueur le 1er janvier 2021 fr_TUN_ecomag 2020-12-30 brex 1
Brexit : La chambre des communes approuve l’accord fr_TUN_ecomag 2020-12-30 brex 1
L’Europe et le Brexit fr_TUN_ecomag 2021-01-04 brex 1

2.3.2 The Covid-19/Coronavirus crisis

Considering the period of observation (2019-2021), we can expect more interesting results if we choose the pandemic of covid-19/coronavirus as target of our analysis. Here, the creation of the dictionary is a bit more complex because the pandemic has been qualified by different words or groups of words, with changes through time. We need therefore to analyze carefully the texts before to elaborate the dictionary. Let’s start with a set of three words :

code =c("cov","cov","cov","cov","cov","cov","cov")
  label=c("covid","Covid", "Covid-19", "covid-19","coronavirus", "Coronavirus","corona virus")
  lang = c("fr","fr","fr","fr","fr","fr","fr")
  dict <- data.frame(code,label,lang)
  kable(dict)
code label lang
cov covid fr
cov Covid fr
cov Covid-19 fr
cov covid-19 fr
cov coronavirus fr
cov Coronavirus fr
cov corona virus fr

Then we apply the function extract_tags that we have seen in the previous section :

qd <- readRDS("data/fr_TUN_ecomag.RDS")
  
  qd<-corpus_subset(qd,duplicated(as.character(qd))==FALSE)
  
  frcomps<-c("France Inter", "France Info","France Soir",
             "Bourse de Paris", "Paris SG", "Ville de Paris", "Grand Paris")
  
  qd <- extract_tags (qd = qd,
                       lang="fr",
                       dict = dict,
                       code = "code",
                       name = "cov",
                       split = c("'","’","-"),
                       comps = frcomps,
                       tolow = FALSE)
  
  table(qd$nb_cov)

      0     1 
  11492  1276 

Now, more than 10% of the news are related to the pandemic and we can have a look at the first news published in the newspaper at the beginning of the crisis :

w<-tidy(corpus_subset(qd,nb_cov>0))
  w<-w[order(w$date),]
  kable(head(w,20))
text source date cov nb_cov
Aucun cas de coronavirus n’a été enregistré en Tunisie fr_TUN_ecomag 2020-01-27 cov 1
Chine : le premier hôpital dédié aux coronavirus ouvre ses portes après 48h de travaux fr_TUN_ecomag 2020-01-31 cov 1
Tunis : L’ambassade de Chine appelle à ne pas exagérer les faits du coronavirus fr_TUN_ecomag 2020-02-03 cov 1
Coronavirus-Alger : dix Tunisiens évacués à bord d’un avion algérien fr_TUN_ecomag 2020-02-03 cov 1
Le Remdesivir : un traitement contre le Covid-19 à l’essai fr_TUN_ecomag 2020-02-19 cov 1
Coronavirus en Italie : annulations en cascades de manifestations sportives fr_TUN_ecomag 2020-02-25 cov 1
Coronavirus : éviter le port des masques pour les personnes non contaminées fr_TUN_ecomag 2020-02-27 cov 1
Coronavirus : les analyses effectuées sur la citoyenne de retour de Milan sont négatives fr_TUN_ecomag 2020-02-28 cov 1
Coronavirus : le plus grand rendez-vous touristique dans le monde est annulé fr_TUN_ecomag 2020-02-29 cov 1
Nabil Bziouech : le tourisme tunisien aurait vécu une catastrophe si le coronavirus s’était déclenché fr_TUN_ecomag 2020-03-02 cov 1
Coronavirus : des cellules de suivi des TRE créées en Italie fr_TUN_ecomag 2020-03-02 cov 1
Coronavirus – Italie : aucun cas de contamination parmi les Tunisiens fr_TUN_ecomag 2020-03-02 cov 1
Coronavirus : comment le test de dépistage fonctionne-t-il ? fr_TUN_ecomag 2020-03-03 cov 1
Marché actions et Coronavirus : que faire? fr_TUN_ecomag 2020-03-03 cov 1
La Banque Mondiale octroie jusqu’à 12 milliards de dollars d’aide rapide contre le coronavirus fr_TUN_ecomag 2020-03-04 cov 1
Gabès – Covid-19 : isolement à domicile de trois personnes de retour d’Italie fr_TUN_ecomag 2020-03-05 cov 1
Coronavirus : les compétitions africaines menacées, les JO de Tokyo maintenus fr_TUN_ecomag 2020-03-08 cov 1
Coronavirus : la conversion à l’Islam suspendue en Tunisie fr_TUN_ecomag 2020-03-09 cov 1
Coronavirus : contagion économique et impacts sur la Tunisie fr_TUN_ecomag 2020-03-09 cov 1
7e cas confirmé de coronavirus en Tunisie fr_TUN_ecomag 2020-03-11 cov 1

The chronology of news defines an interesting storyline that deserve a qualitative analysis.But we can also try to have a quantitative view of theproportion of news related to the new pandemics during the period of observation.

w<-docvars(qd)
  chrono <-data.table(w)
  chrono$week = cut(chrono$date,breaks = "week")
  chrono$pandemic = as.factor(chrono$nb_cov !=0)
  levels(chrono$pandemic)<-c("No","Yes")
  chrono <- chrono[,list(nb=.N),list(week,pandemic)]
  chrono <- dcast(chrono, formula = week~pandemic, value.var="nb",fill = 0) 
  chrono$tot <-chrono$No+chrono$Yes
  chrono$pct<-100*chrono$Yes/chrono$tot
  chrono$week<-as.Date(chrono$week)
  plot(chrono$week,chrono$pct,
        type="l",col="red",lwd=1,
        xlab= "Time distribution by week",
        ylab = "% of news",
        main= "Share of news related to the topic",
       sub = "source : Mediacloud")

Bibliographie

BARNIER, Julien, 2021. rmdformats: HTML Output Formats and Templates for ’rmarkdown’ Documents [en ligne]. S.l. : s.n. Disponible à l'adresse : https://github.com/juba/rmdformats.
R CORE TEAM, 2020. R: A Language and Environment for Statistical Computing [en ligne]. Vienna, Austria : R Foundation for Statistical Computing. Disponible à l'adresse : https://www.R-project.org/.
XIE, Yihui, 2020. knitr: A General-Purpose Package for Dynamic Report Generation in R [en ligne]. S.l. : s.n. Disponible à l'adresse : https://CRAN.R-project.org/package=knitr.

Annexes

Infos session

setting value
version R version 4.1.0 (2021-05-18)
os Windows 10 x64
system x86_64, mingw32
ui RTerm
language (EN)
collate French_France.1252
ctype French_France.1252
tz Europe/Paris
date 2021-11-26
package ondiskversion source
dplyr 1.0.6 CRAN (R 4.1.0)
ggplot2 3.3.3 CRAN (R 4.1.0)
knitr 1.34 CRAN (R 4.1.1)
quanteda 3.0.0 CRAN (R 4.1.0)
readtext 0.80 CRAN (R 4.1.0)
rmarkdown 2.11 CRAN (R 4.1.1)
rzine 0.1.0 gitlab ()
tidytext 0.3.1 CRAN (R 4.1.1)
WikidataR 2.3.1 CRAN (R 4.1.1)

Citation

@Manual{ficheRzine,
    title = {Titre de la fiche},
    author = {{Auteur.e.s}},
    organization = {Rzine},
    year = {202x},
    url = {http://rzine.fr/},
  }


Glossaire